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ABSTRACT 

We have developed a special-purpose computer for gravitational many-body simula- 
tions, GRAPE-5. GRAP15-5 is the successor of GRAPE-3. Both consist of eight custom 
pipeline chips (G5 chip and GRAPE chip). The difference between GRAPE-5 and 
GRAPE-3 are: (1) The G5 chip contains two pipelines operating at 80 MHz, while the 
GRAPE chip had one at 20 MHz. Thus, the calculation speed of the G5 chip and that of 
GRAPE-5 board are 8 times faster than that of GRAPE chip and GRAPE-3 board. (2) 
The GRAPE-5 board adopted PCI bus as the interface to the host computer instead of 
VME of GRAPE-3, resulting in the communication speed one order of magnitude faster. 
(3) In addition to the pure 1/r potential, the G5 chip can calculate forces with arbitrary 
cutoff functions, so that it can be applied to Ewald or P^M methods. (4) The pairwise 
force calculated on GRAPE-5 is about 10 times more accurate than that on GRAPE-3. 
On one GRAPE-5 board, one timestep of 128k-body simulation with direct summation 
algorithm takes 14 seconds. With Barnes- Hut tree algorithm [9 = 0.75), one timestep 
of 10^-body simulation can be done in 16 seconds. 

Subject headings: many-body simulation — Numerical methods — Stars: stellar dy- 
namics — Cosmology 
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1. Introduction 



GRAPE-3 Processor Board 



In this paper, we describe the hardware architec- 
ture and performance of GRAPE-5, the newest addi- 
tion to GRAPE series of special-purpose computers. 

GRAPE (for "GRAvity PipE"; see Sugimoto et 
al. 1990, Makino and Taiji 1998) is a special-purpose 
computer to accelerate the gravitational many-body 
simulation. It has pipelines specialized for the force 
calculation, which is the most expensive part of the 
gravitational many-body simulation. All other calcu- 
lations, such as time integration of orbits, are per- 
formed on a host computer connected to GRAPE. 
Figure |] illustrates the basic structure of a GRAPE 
system. In the simplest case, the host computer sends 
positions and masses of all particles to GRAPE. Then 
GRAPE calculates the forces between particles, and 
sends them back to the host computer. The GRAPE 
system achieved high performance on the gravita- 
tional many-body simulation through pipelined and 
highly parallelized architecture specialized for the 
force calculation. 
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Fig. 2. — Block diagram of the GRAPE-3 processor 
board. 



GRAPE-5 is the successor of GRAPE-3. GRAPE- 
5 embodies the improvement of a factor of 10 in cal- 
culation speed, communication speed, and accuracy, 
over GRAPE-3. In addition, it can be applied to 
Ewald or P^^M algorithms, since it can evaluate the 
force and potential with arbitrary cutoff. 

The structure of this paper is as follows: In sec- 
tion ^ we briefly describe force calculation algorithms 
used with GRAPE-5. In section || we describe the 
hardware of the GRAPE-5 system. In section ^ we 
discuss the calculation accuracy of GRAPE-5. In sec- 
tion ^ we present the timing results. In section |6| we 
discuss future prospects. 



Fig. 1. — Basic structure of GRAPE system. 

GRAPE-3 (Okumura et al. 1993) is the first 
GRAPE system with multiple force calculation pipelines. 
Figure ^ shows the architecture of GRAPE-3A board 
with 8 pipelines. One pipeline is integrated into the 
GRAPE chip. The peak performance of the GRAPE 
chip is 0.6 Gflops at a clock cycle of 20 MHz, and 
that of a system which contains 8 GRAPE chips is 
4.8 Gflops. Like GRAPE-1 (Ito et al. 1990) and 
GRAPE-IA (Fukushige et al. 1991), GRAPE-3 uses 
the number format with short mantissa, in order to 
reduce the chip size. The r.m.s. relative error of the 
pairwise force is about 2%, which is low enough for 
most of coUisionless simulations (Makino 1990, Hern- 
quist et al. 1993, Athanassoula et al. 1998). Copies 
of GRAPE-3 are used in many institute inside and 
outside Japan {e.g. Brieu et al. 1995, Padmanabhan 
et al. 1996, Steinmetz 1996, Nakasato et al. 1997, 
Klessen et al. 1998, Mori et al. 1999, Theis and 
Spurzem 1999, Koda et al. 1999). 



2. Force Calculation Algorithms 

In this section we briefly discuss force calcula- 
tion algorithms used on GRAPE-5. The 0{N'^) di- 
rect summation is the simplest algorithm. To obtain 
the force on a particle, GRAPE-5 simply calculates 
and adds up forces from all particles in the system. 
Though the direct summation is simple and efficient 
for small- (say N <10^) systems, for large- sys- 
tems force calculation becomes expensive, even with 
GRAPE hardware. In the following, we describe two 
algorithms to reduce the calculation cost. First we 
describe the Barnes-Hut tree algorithm (Barnes and 
Hut 1986), and then P^M (particle-particle particle- 
mesh) method (Hockney and Eastwood 1981). 

2.1. Barnes-Hut Tree Algorithm 

The Barnes-Hut tree algorithm (Barnes and Hut 
1986) reduces the calculation cost from 0{N'^) to 
O(A^logA^), by replacing forces from distant parti- 
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cles by that from a particle at their center of mass (or 
multipole expansion) . Vectorization of tree algorithm 
is discussed in Barnes (1990), Hernquist (1990) and 
Makino (1990). Parallelization has been discussed in 
numerous articles (Salmon and Warren 1992, Salmon 
et al. 1994, Dubinski 1996, Warren et al. 1997, Ya- 
hagi et al. 1999). 

The application of GRAPE hardwares to the tree 
algorithm are discussed in Makino (1991) and Athanas- 
soula et aZ.(1998). Vectorized versions of tree algo- 
rithms based on Barnes' modified algorithm (1990) 
are used in these articles. With this algorithm, we 
can use GRAPE more efficiently than with the origi- 
nal algorithm. In the original algorithm, tree traver- 
sal is performed for each particle. In the modified 
algorithm, tree traversal is performed for a group of 
neighboring particles and an interaction list is cre- 
ated. GRAPE calculates the force from particles and 
nodes in this interaction list to particles in the group. 

The modified tree algorithm reduces the calcula- 
tion cost of the host computer by roughly a factor 
of Ug, where Ug is the average number of particles 
in groups. On the other hand, the amount of work 
on GRAPE increases as we increase rig, since the in- 
teraction list becomes longer. There is, therefore, an 
optimal value of Ug at which the total computing time 
is minimum (Makino 1991). The optimal value of Ug 
depends on various factors, such as the relative speed 
of GRAPE and its host computer, the opening param- 
eter and number of particles. For present GRAPE-5, 
ng = 2000 is close to optimal. 

When we need high accuracy for the force calcu- 
lation, we can use P^M^ (pseudo-particle multipole 
method; Makino 1998). In the algorithm described 
above, we can not handle higher-order terms of the 
multipole expansion on GRAPE, since GRAPE can 
calculate 1/r^ force only. Thus, the amount of the 
calculation grows quickly when high accuracy is re- 
quired. In P^M^, high-order expansion terms are 
expressed by forces from a small number of pseudo- 
particles, and thus we can evaluate these terms on 
GRAPE. Using P^M^, Kawai and Makino (1999) im- 
plemented arbitrary-order tree algorithm on GRAPE- 
4. The timing results show that the calculation with 
P^M^ is faster than that without P^M^, when the 
total force error smaller than ~ 10^* is required. 



2.2. P^M Method 

The P^M method (Hockney and Eastwood 1981) 
calculates gravitational interaction under periodic bound- 
ary condition. The total force is divided into long- 
range and short-range forces. The long-range (PM) 
force is computed in wave number space using fast 
Fourier transform (FFT). The short-range (PP) force 
is calculated directly. 

Since the PM part takes care of the long-range in- 
teraction, the force calculation in the PP part is not a 
pure 1/r^ gravity but with cutoff. For example, Efs- 
tathiou and Eastwood (1981) used the following form 
as the PP force exerted from a particle located at r 



„ . ,. mir — r') , , 



\r — r 



(1) 



where m is the mass of the particle and r' is the 
position at which the force is evaluated. The cutoff 
<?P3M is expressed as 



5P3m(-R) 



l-^(224i?3 



224i?5 -f 70i?^ 
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l-Y^(12 - 224i?2 
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for 1< i? < 2 



L for i? > 2, 

(2) 

where R = \r — rj\/r], and 77 is a scale length. If 
higher accuracy is desired, one can use a Gaussian 
cutoff (Ewald 1921). GRAPE-5 can calculate these 
forces using its programmable cutoff table. 

Brieu et al. (1995) implemented the P'^M method 
on GRAPE-3. Since GRAPE-3 can calculate force 
with Plummer softening only, they expressed the PP 
force as a combination of three forces with differ- 
ent Plummer softening radius. Therefore they used 
GRAPE-3 three times to evaluate one PP force. This 
procedure is rather complex, and yet the accuracy is 
rather low. GRAPE-5 evaluates this interaction with 
cutoff in single call, and in high accuracy. 

3. Hardware of GRAPE-5 System 

In this section we describe the function and the 



architecture of GRAPE-5 system. In section 3.1 we 
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show the overall architecture of the GRAPE-5 system 
and overview the function of the GRAPE-5 processor 
board. In section 3.2 we give a detailed description 
of the G5 chip, a custom chip for force calculation. 



GRAPE-5 Processor Board 



Sections 3.3-3.6 are devoted to description of compo- 
nents of the processor board other than the G5 chip. 
Peak performance and other miscellaneous aspects of 



the board are described in section 3.7 



3.1. Overall Architecture 

Figure ^ shows the basic structure of the GRAPE-5 
system. It consists of three components: a GRAPE-5 
processor board, a PCI Host Interface Board (Kawai 
et al. 1997), and the host computer. The processor 
board is connected to PCI Host Interface Board (here- 
after we call PHIB) via Hhnk (Makino et al. 1997). 
PHIB is attached to PCI I/O bus of the host com- 
puter. PCI bus is an I/O bus standard widely used 
from desktop PCs to supercomputers. 

Host Computer 



PHIB 



PCI Bus 



HIink 



GRAPE-5 
Processor Board 



Fig. 3. — Basic structure of the GRAPE-5 system. 

Figure ^ shows the block diagram of the GRAPE- 
5 processor board. It consists of five units: the G5 
chips, the memory unit, the particle index unit, the 
neighbor list unit, and the interface unit. The G5 chip 
is a custom VLSI chip which integrates two pipeline 
processors to calculate gravitational interactions. The 
memory unit supplies particle data to the G5 chips. 
The particle index unit supplies indices of particles to 
the memory unit during force calculation. This unit 
can supply indices in a special manner optimized to 
the cell-index method to calculate a short-range force, 
such as the PP force in P'^M method. The neighbor 
list unit constructs the list of the nearest neighbor 
particles. The interface unit handles the communica- 
tion with the host computer. In the following subsec- 
tions we describe these units. 

3.2. The G5 chip 
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Fig. 4. — Block diagram of the GRAPE-5 processor 
board. 



3.2.1. Function 

The G5 chip calculates the force exerted on particle 
i at position r^. The force /j is expressed as 



N 



(3) 



where N is the number of particles and 



f ij 







for 
for 



(4) 

Here is the force (per unit mass) from particle j to 
particle i. Note that Vj and rrij are the position and 
the mass of particle j, and that rg^ij is the softened 
distance between particle i and j defined as = 
\ri — rj\'^ + ef, where is the softening parameter for 
particle i. Here and hereafter, we use index i for the 
particle on which the force is evaluated and index j for 
particles that exert forces on particle i. The function 
g is an arbitrary cutoff function, ?7 is a scale length 
for the cutoff function, and rcut is the cutoff length. 
The G5 chip also calculates potential associated 
with force / j , using cutoff function different from that 
for force calculation. During force calculation. The 
G5 chip asserts the neighbor flag if the distance of 
particles rgjj is less than a given neighbor radius, hi. 

3.2.2. Overall architecture 

The G5 chip consists of two pipeline units and one 
I/O unit, as shown in figure ^. The pipeline unit 
calculates the gravitational interaction. The I/O unit 
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handles data transfer between the pipchnc unit and 
the external I/O port. 

G5 Chip 
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Fig. 5. — Block diagram of the GRAPE-5 processor 
chip (G5 chip). 

Figure || shows the block diagram of the pipeline 
unit. The number attached to each line is the num- 
ber of bits used. The number format will be discussed 
later. The position vector rj and the mass data rrij 
are supplied from the memory unit. The position vec- 
tor Ti, the softening parameter e;, and the neighbor 
radius hi are stored in the on-chip register before the 
pipeline starts calculation. The pipeline unit calcu- 
lates one interaction per clock cycle, and accumulate 
it to on-chip registers. The pipeline unit outputs the 
neighbor flag when < hf. The function generator 
calculates 1/ g{rs,ij /rj) and 1/ g^{rs,ij /rj) from r^ij and 
rf . Figure shows the block diagram of the function 
generator. The cutoff functions g and are evalu- 
ated by table lookup. The tables arc implemented as 
on-chip RAM blocks. 



G5 chip has the virtual multiple pipeline architec- 
ture (Makino et al. 1993) to reduce the necessary 
bandwidth of data transfer during force calculation. 
In this architecture, one pipeline acts as multiple 
pipelines operating at a slower speed. The G5 chip 
has six virtual pipelines per real pipeline unit and 
12 virtual pipelines in total. Each real pipeline unit 
calculates the forces exerted on 6 different particles. 
Data for particle j is used for 6 clock cycles. The 
memory unit operates at a clock frequency 1/6 of that 
of the G5 chip. This architecture simplifies the board 
design. 

Figure 1^ shows the I/O specification of the G5 chip. 
It has four input ports for particle data (XJ[31:0], 
YJ[31:0], ZJ[31:0], and MJ[16:0]), one input and one 
output port to the host computer (DATAI[31:0] and 
DATAO[31:0]), one address bus (ADR[10:0]), six con- 
trol input pins (CLK, SYSCLK, RUN, CS, OE and 
WE), and 13 output pins (NB[11:0] and BUSY). The 
CLK supplies clock signal for chip internal operations. 
The SYSCLK supplies clock signal for I/O operations. 
The frequency of SYSCLK is 1/6 of that of CLK. 
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Fig. 8. — I/O specification of the G5 chip. 



Function Generator 



,2 16 _ 16 



Entry 
Generator 


8 


Table 
(Force) 


/ 





Table 
(Potential) 



12 
-V— 



13 
-V— 



cull/) -i- 



13 
-V— 



1 

Non Zero Bit 



Fig. 7. — Block diagram of the function generator of 
the G5 chip. 



3.2.3. Number format 

Following the design of GRAPE-1 (Ito et al. 1990) 
and GRAPE-3, we adopted word length shorter than 
those used on conventional computers for the G5 chip. 
The word length directly determines the number of 
transistors. In order to achieve high performance at 
low cost, it is crucial to use the minimum word length 
that ensures the validity of the calculation. The word 
length used in the G5 chip are shown in figure ^ and 
figure I?! The number of bits is attached to each data 
line. 

We adopted the logarithmic format for most of the 
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Fig. 6. — Block diagram of the pipeline unit of the G5 chip. 



operations in the pipeline unit except for the subtrac- 
tion of the position vectors, lookup of the cutoff ta- 
ble, and the final accumulation of the force. We prefer 
the logarithmic format over the fixed-point format be- 
cause it has larger dynamic range for the same word 
length. Of course, the usual floating-point format also 
can achieve a wider dynamic range. We chose the log- 
arithmic format because operations such as multipli- 
cation and square root are easier to implement in the 
logarithmic format than in the floating-point format. 
Although the addition becomes complex, the loga- 
rithmic format is more efficient for the word length 
we used for GRAPE-5. 

In the logarithmic format, a positive, non-zero real 
ntimber x is represented by its base-2 logarithm y as 



2^. 



(5) 



In G5 chip, we use 15 bits to express y in unsigned 
fixed-point format, with 8 bits below binary point. 
The 16th bit indicates whether a; is or not (non- 
zero bit), and the 17th bit indicates the sign. Thus 
the total number of bits per word is 17. This format 
can express real number in the range of [1, 2^^^), and 
its resolution is 2^/256 _ i ~ 0.003. 

We use 32-bit fixed-point 2's-complement format 
for each component of the position vector rt and rj, 



in order to guarantee uniform resolution and to sim- 
plify implementation of the periodic boundary con- 
dition. The selection of the minimum image is per- 
formed automatically, by setting the size of the box 
length to the maximum expressible number (2"^^ — 1). 
This format gives a spatial resolution of 2^'^^ ~ 10^^. 

We use 64-bit and 50-bit fixed-point format for ac- 
cumulation of the force and potential, respectively. 
We adopt the fixed-point format because the adder 
(accumulator) is simpler and its cost is lower in this 
format than that in logarithmic or floating-point for- 
mat. 

For conversion between the fixed-point format and 
the logarithmic format, we use format converters de- 
scribed in section 3.2.4 and 3.2.7. For addition in 



the logarithmic format, we use logarithmic adder de- 



scribed in section 3.2.5 



3.2.4- Format converter (fixed point to logarithmic) 

Here we describe the hardware to convert the fixed- 
point format to the logarithmic format. Output num- 
ber has /3 = 2 + 7 + (5 bits, where 7 and S are the 
number of bits above and below the binary point, re- 
spectively. In the case of the G5 chip, /3 = 17, 7 = 7 
and S — 8. 

Figure || shows the block diagram of the format 
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converter. First we convert the 2's complement for- 
mat to the sign+magnitude format. Then we calcu- 
late the "integer" part of logarithm (upper 7 bits) 
using a priority encoder. 
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Fig. 9. — Block diagram of the format converter 
(fixed-point format to logarithmic format). 

The fractional part of logarithm is calculated as 
follows. We first normalize the magnitude of x by re- 
moving leading zeros. This is done by a logical shifter 
with control input from the priority encoder. The 
output of the shifter is then supplied to a table which 
convert a normalized number to the base-2 logarithm. 
In case of G5 chip, the table has 512 (= 2^) entries to 
ensure that the round-off error generated at conver- 
sion is small. 

3.2.5. Logarithmic adder 

The (unsigned-)logarithmic adder performs addi- 
tion of two positive number X and Y in logarithmic 
format. The design of logarithmic adder of G5 chip 
is basically the same as that of GRAPE chip. We 
designed it using the following relation 



X 



Z = log2(2 
= log2 2^ 



2^) 

,2^ ^log2(l + 2V2 
X + log2(l + 2^-^) 



(6) 



for X >Y, or, 

Z = max(X,F) + log2(l + 2-l^-^') (7) 

for general X and Y (Kingsbury and Rayner 1971, 
Swartzlander and Alexopoulos 1975). 

Figure ^ shows the block diagram of the logarith- 
mic adder. First we calculate X — Y and Y — X, and 
choose the positive one using a multiplexer. Then we 
use table lookup to obtain log2(l -I- 2~l"^~^l) from 
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Table 
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Fig. 10. — Block diagram of the logarithmic adder. 

\X — Y\. Finally we obtain Z by adding the output 
of the table to the larger one of X and Y. 

In practice, we do not need to prepare table for all 
possible values of \X - Y\. If log2(l + 2-^^-^^) < 
2^/^ — 1, the result of addition Z is equal to X (as- 
suming X > Y), after we properly rounded the result. 
Thus, in the case of the G5 chip with S — 8, we need 
table only for \X - Y\ < - log2[2(2'"'-i) - 1] ~ 9.1 
(see Makino and Taiji 1998, section 4.6 for detailed 
discussion). 

3.2.6. Cutoff function generator 

Figure shows the block diagram of the cutoff 
function generator. The cutoff function g{rs.ij /ri) and 
9(ti{i^s,ij /v) ^re calculated from r'^.^j and 77^. First we 
divide by 77^ using subtracter and input the result 
to the entry generator. Then wc supply the output of 
the entry generator to the cutoff function tables, and 
obtain 1/g and I/50 as output of those tables. The 
contents of the cutoff function tables are set by the 
user before the force calculation starts. 

The cutoff function table is realized by an on-chip 
RAM. The RAM table consumes significantly larger 
number of transistors per bit than ROM tables used 
in the format converter and the logarithmic adder. In 
order to achieve a good cost performance, it is impor- 
tant to keep the size of the table as small as possible. 

We reduce the size of the table by taking advan- 
tage of the nature of the cutoff function. The cutoff 
function g converges to 1 when rg^ij/rj approaches to 
0, and converges to when rs^ij/r] 1 (see figure 
[Tl] ). Therefore we need high resolution only when 
i's,ijl'n ~ 1. Using these characteristics, we can re- 
duce the number of entries to the cutoff table for small 
Ts^ijh 0.1) and large rs,ij/r] (^ 2). 
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.0001 



Fig. 11. — The cutoff function for P^M method, 
5P3M, given by equation (||). 



The entry generator reduces the number of entries 
to the cutoff function table in the following two steps. 
At the first step, the input r'^ij/rj'^ expressed in 15-bit 
logarithmic format (without sign and non-zero bit) is 
converted to a 9-bit integer number. The conversion 
is expressed as 



(8) 



where A is the logarithmic part of the input number 
interpreted as unsigned integer, and B is the output. 
This conversion reduces the number of entries at small 
rsAj/ij. At the second step, we reduce the entries 
at large rs^ij /f] using the conversion given in table ^. 
An 8-bit integer number is obtained as the conversion 
result. The maximum rs^ij/r] that is expressible in 
the final format is 3.0. This value is large enough for 
typical cutoffs such as gpsM given by equation (0) and 



a Gaussian cutoff, since the value of cutoff at ' 



3.0 is smaller than the force calculation error. In the 
actual implementation, the entry generator directly 
converts the input to the final format by single table 
lookup. 

In order to reduce the size of the RAM table, we 
should reduce not only the number of entries but also 
the word length. The logarithmic format we used has 
17 bits. We do not need sign and non-zero bits for 
g. In addition, we do not need the full 7-bit integer 
part, since g smaller than 1/256 can be treated as 
zero without affecting the overall accuracy. So the 
minimum number of bits above binary point is three. 
We choose to assign 4 bits above the binary point. 



Table 1: Entry reduction procedure for cutoff function 
9 





input 


output 






{B in decimal) 


(in binary) 




0.0 - 1.5 


- 191 


B[7] 


B[0] 


1.5 - 2.0 


192 - 255 


1 1 B[5] - 


B[l] 


2.0 - 3.0 


256 - 383 


1 1 1 B[6] - 


B[2] 


3.0 - 


384 - 


111111 


1 1 



The table actually stores 1/g instead of g, since the 
format can only express numbers not smaller than 1. 

3.2.7. Format converter (logarithmic to fixed point) 

Figure |l^ shows the block diagram of the circuit 
to convert the logarithmic format to the fixed-point 
format. First we convert the fractional part of the log- 
arithmic format to a normalized number (exponential 
of the input) by table lookup. Then we shift it accord- 
ing to the integer part of the logarithmic format. 
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Fig. 12. — Block diagram of the format converter 
(logarithmic format to fixed-point format). 



3.2.8. Packaging and other miscellaneous aspects 

The G5 chip is fabricated by NEC Corporation us- 
ing 0.5 /im gate array process (CM0S-8L family). It 
consists of ^ 200 K gates (nominal 300 K gates with 
65 % usage) and is packaged in a ceramic pin-grid 
array with 364 pins. It operates at 80MIIz clock cy- 
cle. The power supply voltage is 3.3 V and the power 
consumption is about 10 W. We designed the G5 chip 
using logic synthesis tool (AutoLogic II by Mentor 
Graphics Corp.). All design is expressed in the VHDL 
language. 
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3.3. Memory Unit 

The memory unit supplies position vectors r j and 
masses rrij of the particles to the G5 chip according to 
the indices supplied by the particle index unit. The 
particle data Tj and rrij are shared by all G5 chips. 
The memory unit is composed of four 4 Mbit (128 
Kword X 32 bit) synchronous SRAM chips. Three 
are for position vectors and one is for masses. This 
unit can store up to 131072 particles. To simulate 
a system with more than 131072 particles, we need 
to divide the particles into subgroups each of which 
includes less than 131072 particles. We can calculate 
the total forces by summing up the partial forces from 
each subgroup. 

3.4. Particle Index Unit 

The particle index unit supplies particle indices 
(memory address) to the memory unit. This unit 
is optimized for the cell-index method (also known 
as the linked-list method, Quentrec and Brot 1975, 
Hockney and Eastwood 1981). We adopt the hard- 
ware design used in MD-GRAPE (Fukushige et al. 
1996). 

The cell-index method is a scheme to reduce the 
calculation cost of the short-range force. In this 
method, the cube which includes entire simulation 
space (simulation cube) is divided into cells where 
M is the number of cells in one dimension. Usually, 
we set M to the largest integer not exceeding L/rcut, 
where rcut is the cutoff length of the short-range force 
and L is the side length of the simulation cube. In 
order to calculate the forces on particles in a cell, 
we need to calculate the contributions only from the 
particles in 27 neighbor cells. Therefore, the calcula- 
tion cost is reduced by a factor of 27 /M^. One could 
also further reduce the calculation cost by reducing 
the cell size and calculating the contribution from all 
cells whose distance from the cell in question is not 
exceeding rcut- 

Figure |l^ shows the block diagram of the particle 
index unit. It consists of the cell-index memory and 
two counters: the cell counter and the particle index 
counter. When we store particles to the memory unit, 
we rearrange the order of particles so that particles 
in the same cell occupy consecutive locations. Thus, 
we can specify all particles in one cell by its start and 
end addresses. 

The cell counter is a 7-bit counter, and the particle 
index counter is a 17-bit counter. These two counters 
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Fig. 13. — Block diagram of the particle index unit 
of the GRAPE-5 processor board. 



and cell-index memory are packaged in an EPF10K30 
FPGA device (Field Programmable Gate Array, Al- 
tera Corp.), together with the interface unit. 

3.5. Neighbor List Unit 

The neighbor list unit stores the list of the nearest 
neighbors for particle i. Particle j is the neighbor of 
particle i if Vs^ij < hi . One neighbor list unit handles 
the neighbors of particles calculated on four G5 chips. 
So we have two neighbor list units on one board. Fig- 
ure |l^ shows the block diagram of the neighbor list 
unit. Each of four G5 chips outputs neighbor flags for 
12 particles at every clock cycle of the board. Thus 
the neighbor list unit receives the flags for 48 par- 
ticles. These neighbor flags are stored together with 
the corresponding particle index j in the neighbor list 
memory if any of them is asserted. The host com- 
puter reads the data from the neighbor list memory 
after the force calculation is finished. 
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Fig. 14. — Block diagram of the neighbor list unit of 
the GRAPE-5 processor board. 

The neighbor list memory is composed of two 1 
Mbit (32 Kword x 32 bit) synchronous SRAMs. It 
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can hold up to 32768 neighbors for 48 particles. The 
memory stores 48 flag bits and lower 16 bits of parti- 
cle index. The MSB of the particle index is stored in 
the FPGA chip which implements other functions of 
neighbor list as well. The MSB of the particle index 
changes from zero to one only once, since the particle 
index always increases. Therefore, we need to store 
only the value of the internal address counter of the 
neighbor list at which MSB of the index counter is 
first set. Using the value of this register, we can re- 
cover the MSB of particle index on the host computer. 

The logic for the neighbor list unit other than the 
neighbor list memory is implemented in an EPF10K30 
FPGA device. 

3.6. Interface Unit 

The interface unit controls communication between 
the host computer and the GRAPE-5 system. It 
recognizes the following five commands: (1) receive 
data for particle j; (2) receive data for particle i; (3) 
start the force calculation; (4) send back the calcu- 
lated force /j and potential (f>i; (5) send back the 
neighbor list. The control logic is implemented in an 
EPF10K30 FPGA device, together with the particle 
index unit. 

3.7. Peak Performance and Other Miscella- 
neous Aspects 

The peak performance of a GRAPE-5 board is 38.4 
Gflops. The G5 chip calculates 1.6 x 10* interactions 
per second. It calculates two pairwise interactions in 
each clock cycle, and operates at a clock cycle of 80 
MHz. If we count the calculation of the gravitational 
force as 30 floating-point operations, the peak per- 
formance of the G5 chip is equivalent to 4.8 Gflops. 
Thus the peak speed of a board with eight G5 chips is 
38.4 Gflops. The sustained performance is discussed 
in section |5[ 

Figure |5| shows photograph of the GRAPE-5 pro- 
cessor board. The size of the board is 275 mm x 
367 mm. Power supply voltage is 3.3 V and the total 
power consumption is about 70 W. 

We started designing the G5 chip in 1996 May. The 
chip was completed in 1998 June. The first prototype 
of GRAPE-5 board with four G5 chips was completed 
in 1998 October. The production version of the pro- 
cessor board was completed in 1999 April. 



4. Accuracy of Force Calculation 

In this section we discuss the calculati on a ccuracy 
of the force from one particle. In section 4.1, we dis- 
cuss the a ccur acy of the 1 /r'^ gravity with softening. 
In section 4.2, we discuss the accuracy of the force 
with cutoff. 

4.1. Accuracy of 1/r^ Force 

A detailed analysis of the error propagation process 
(Makino et al. 1990) gives an estimate of the relative 
error of the force from one particle as 



Sr{f) = 



< 




12 

r - ■ 

Qj.2 



If e <^ r, the inequality becomes 

12 2 21 2 

7^^^ + Y^/- 



15 



Sr{f) <, 



(9) 
(10) 

(11) 



Here / and / are exact and calculated forces from a 
particle at distance r. The parameter is the r.m.s. 
absolute error of the fixed-point format used for the 
position vectors {ri and rj), and e/ is the r.m.s. rela- 
tive error of the logarithmic format used for internal 
number expression. The first term of the right hand 
side of inequality ( pT| ) is due to the round-off error of 
Ti and Tj expressed in fixed-point format. The sec- 
ond term is due to the error of internal calculation in 
logarithmic format. 



As we described in section 3.2.S, we used 32-bit 



fixed-point format for position data and 17-bit (8 bits 
for fractional part) logarithmic format for internal 
number expression. Therefore, and e/ are expressed 
as 

6.7 X 10-"r„,ax 



233\/3 



and 



In 2 



7.8 X 10" 



(12) 



(13) 



Here r^ax 
Note that r 



512^/3 

is the maximum value of the coordinates, 
can be specified by software. 
Figure |l^ shows the theoretical and measured val- 
ues of the error SrifY^^ as functions of r. The "ex- 
act" force / is calculated using IEEE-754 standard 
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Fig. 15.~ 



Photograph of the GRAPE-5 processor board. 



64-bit arithmetic, and the force / is obtained from 
the software emulator that gives the same results as 
G5 chip at bit level. The softening parameter is set to 
e = 10~^rniax- The mass is set to to = mmin, where 
TOmin is the Smallest mass with which the force from a 
particle located at far most distance (rmax) does not 
underflow. The theoretical estimate and measured 
errors show a good agreement. For r in the range 
of [10~^rmaxj ''max]j the relative force error is around 

0. 3%. We also plot theoretical estimate of error of 
GRAPE-3 in Figure We can see that the G5 chip 
has 10 times higher accuracy and the dynamic range 
10^ times wider than those of the GRAPE Chip. 

In the case of calculation with an extremely small 
softening, the force from close particle can overflow, 

1. e., the force can be too large to be expressed in the 
format we adopted for force representation. Since the 
maximum force from one particle scales as m/e^, the 
minimum softening parameter eo(TO) with which the 
force does not overflow is expressed as 

/ ^ \ 1/2 
eo(TO) = I I emin, (14) 

V"^min/ 

where emin — lO^^r-max- Figure p7| shows such an ex- 
treme case. The force error for a small softening (e = 
10~^rmax — 10~^emin) IS plotted. We can see that the 
force overflows at r ~ 10~'^rniax- 

We can avoid overflow of the force using masses 




-8 -6 -4 -2 
log(r/7-max) 



Fig. 16. — Estimated error of the force from one 
particle in the case of pure 1/r^ force with soften- 
ing, as functions of distance r scaled with rmax- The 
solid and dashed curves are theoretical estimate for 
GRAPE-5 and GRAPE-3, respectively. The crosses 
are errors obtained with GRAPE-5. The softening 
parameter is e = lO^^rmax- Values are averaged over 
1000 trials. 



11 



Co 



.1 



.01 



.001 



\ 

\ 



eAmax=10 




-6 

log{r/r„ 







Co 



.1 



.01 



.001 



e/r„ 



10 



"T-/^min=10 



-6 -4 -2 







Fig. 17. — Same as figure [ij but for smaller softening. 
Crosses are measured errors for softening parameter 



e = 10^ 



and mass : 



The dashed curve 



is the theoretical estimate. 



smaller than mmin. Equation ( p^ ) implies that eo('7i) 
is smaller than emin, if 'ti is smaller than TOmin- In this 
case, however, the force underflows at large r, i.e., the 
force from distant particle is so small that a part of 
its lower bits are lost. The maximum distance ro{m) 
with which the force does not underflow is expressed 
as 

/ m \ 

ro{m) = ( I Tmax- (15) 

Figure |l^ shows the force error for softening smaller 



than Cmin (e 
smaller than mmin 



10 "rmax — 10 ^emin) and mass 
(to — 10~^TOi„in). We can see that 



the force does not overflow even with softening smaller 
than Cniin, although it underflows at r > 10~^rniax- 
This technique to use masses smaller than TO,nin is 
particularly useful with the tree algorithm. In the tree 
algorithm, distant nodes typically represent many 
particles, while close nodes usually represent fewer 
number of particles. Thus, both the overflow and un- 
derflow are automatically avoided. 

4.2. Accuracy of Force v^rith Cutoff 

First we estimate the accuracy of the cutoff func- 
tion table. The error of the table output is estimated 
as 

\g-g\<9'Q+gCf, (16) 

where g is the exact value of arbitrary cutoff function, 
g is the output of the table, and g' is the derivative of 
g. The parameter Q is the absolute error of the table 



Fig. 18. — Same as flgure [T7| but for smaller mass 
TO = IQ-^mmm- 



entry for the worst case, and Q is the relative error 
of the table output for the worst case. The flrst term 
of the right hand side of inequality (^6|) is due to the 
round-off error of the table entry. The second term is 
due to the round-off error of the table output. 

The resolution of the entry of the table depends on 
the magnitude of the entry itself, as shown in table 
^. The table entry ^5 ^/77 has the finest resolution 
(1/128) in the range of [0.0, 1.5], and the resolution 
doubles at rsAj/rj — 1.5 and rs.ij/rj — 2.0. Therefore 



Ci is given by 



P 

256' 



where 



1 for 0.0 < rs.i]/i] < 1.5 

2 for 1.5 < rsAj/v < 2.0 
4 for 2.0 < rs,jj7?/ < 3.0 
00 for 3.0 < rs^ij/rj. 



(17) 



(18) 



The table output is expressed in logarithmic format 
with 8-bit fractional part, and thus is given by 



In 2 

Cf = ~ 1.4 X 10" 

512 



(19) 



Figure |l^ shows the theoretical and measured val- 
ues of the error as functions of rg^ij/rj. We used the 
cutoff function for P'^M method, gpsMj given by equa- 
tion (H). The theoretical estimate and measured er- 
rors show a good agreement. The error increases dis- 
continuously at rs.ij/r] =1.5 because the resolution 
of the table entry degrades. Even though, the magni- 
tude of the error is not more than 0.5% in all range. 
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expressed as 



Fig. 19. — Error of the cutoff function g calculated 
by the cutoff function table. Solid curve denotes mea- 
sured error. Dashed curve is an estimate of the error 
for the worst case. As the function g we used gpsM 
for P^M method given by equation (H). 

In the following we estimate the error of the force 
with cutoff. We define the error Sr{fc) as 



Srifc) ^ 



fc 



P 



(20) 



where f ^ and f ^ are exact and calculated forces with 
cutoff, and / is a force without cutoff. We defined the 
error relative to the force without cutoff, since that is 
what affects the overall accuracy. 

Following a similar procedure as that for the 1/r^ 
force, the force calculation error is given by 



Sr{f) < 



+ 



18P 


12 


7-4.. 




















2 




If e ^ r, the inequality becomes 



2 

2 



+ 



23 

y 



5_r 
9 V 



(21) 



(22) 



(23) 



where £i is the r.m.s. absolute error of the format 
used for the entry of the cutoff function table, and is 



25673' 



(24) 



Figure gffl shows the theoretical and the measured 
error S{fY''^ for the force with cutoff gpsM- The 
softening parameter is set to e = 10~^rniax- The scale 
length is set to 77 = 1/8. We can see that the error 
of the force with cutoff is no more than 0.4% for r in 
the range of [10~^rniax, ''max] • Note that the measured 
error is systematically better than the theoretical es- 
timate of equation ( ^l| ) for 10~^ri„ax < r < 10~^rinax- 
This is simply because in this range g is very close to 
unity. The output of the table is exactly one. Thus, 
the round-off error of the cutoff table is smaller than 
the estimate of it. 



1 



theoretical 

X measured 



s 

Co 



.01 



.001 - 




-6 -4 
log(r/7-n 



Fig. 20. — Same as figure [l^, but for a force with cut- 
off 5P3M for P-^M method given by equation(||). The 
softening parameter e and the scale length of the cut- 
off function 77 are 10~^rmax and rmax/8, respectively. 



5. Timing Results 

Here we give timing results of the GRAPE-5 sys- 
tem for direct and tree algorithms. For P^M method, 
we give theoretical estimate only. We used one pro- 
cessor board and a host computer COMPAQ AlphaS- 
tation XPIOOO (Alpha 21264 processor 500 MHz). 
For comparison, we also give theoretical estimate for 
GRAPE-3 system. As the host computer of GRAPE- 
3 system, we use SUN Uhra 2/200 (UltraSPARC pro- 
cessor 200 MHz). 
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5.1. Direct Summation Algorithm 

The total calculation time per timestep is ex- 
pressed as 



Tho 



grapel 



■T,, 



(25) 



where Thosti, T'grapei, and Tcommi are the time spent 
on the host computer, the time spent on GRAPE- 
5, and the time spent for data transfer between the 
host computer and GRAPE-5, respectively. The time 
spent on the host computer is expressed as 



T, 



hostl 



Nt„ 



(26) 



where tmisci represents calculation time spent on the 
host computer per particle, which includes time inte- 
gration of the particles, diagnostics, and other miscel- 
laneous 0{N) contributions. The value of tmisci and 
other timing constants used for theoretical estimates 
in this section are summarized in table |[ 

Table 2: Timing constants used for performance esti- 
mation 



parameter 



GRAPE- 

(s) 



GRAPE-3 

(s) 



^miscl 


1.1 


X 


10- 


(j 


5.4 


X 


10- 


6 


^pipc 


1.25 


X 


10 


-8 


5.0 


X 


10- 


8 


^comm,z 


2.1 


X 


10- 


8 


3.3 


X 


10- 


7 


^comm.j 


3.3 


X 


10- 


8 


3.3 


X 


10- 


7 


^comm,/ 


3.6 


X 


10- 


8 


3.3 


X 


10" 


7 


^con 


4.2 


X 


10- 


7 


2.0 


X 


10- 


6 


^list 


5.9 


X 


10- 


7 


2.8 


X 


10- 


6 


^misc2 


8.5 


X 


10- 


7 


4.0 


X 


10- 


6 




8.5 


X 


10- 


8 










^misc3 


5.2 


X 


10- 


6 











The time spent on GRAPE-5 is expressed as 
NX 



grapel 



t-pipe 



(27) 



pipe 



Here 



''pipc 



is the number of real pipelines per proces- 



sor board, which equals 16. The number ipipo is the 
cycle time of the G5 chip. 

The time spent for data transfer is expressed as 



+ I2qt. 



comm. t 



32Q^comm,/) 7 

(28) 

are the time to transfer 
one byte data from the host computer to GRAPE-5 



where tcommj- and 



for particle j and particle i, respectively. The transfer 
speed for particle j is faster than that for particle i, 
since the size of the data transferred in one transac- 
tion is larger for particle j. Large data implies small 
overhead and fast overall speed. The constant icommj 
is the time to transfer one byte data from GRAPE- 
5 to the host computer for calculated force, and q is 
given by 

" iV- 1 



1 



(29) 



where \_x\ denotes the maximum integer which does 
not exceed x, and nmem(= 131072) is the maximum 
number of particles which can be stored in the mem- 
ory unit of a GRAPE-5 processor board. This q indi- 
cates how the total force on a particle is divided. If 
N > rimcm, the force on a particle must be divided 
into q pieces, and GRAPE-5 must be used q times to 
obtain the total force. 

Figure ^ shows measured and estimated calcula- 
tion time per one timestep as functions of N. Figure 
p2| shows measured and estimated speed as functions 
of the number of particle N. For GRAPE-5 system, 
we chose timing constant of the host computer, ^hosti 
to fit the measured results. For GRAPE-3 system, 
we used thosti scaled from the one for GRAPE-5 sys- 
tem, according to the ratio of SPECfp95 values of the 
hosts. For the data transfer times, we used measured 
values for GRAPE-5 system, and the value given in 
Athanassoula et al. (1998) for GRAPE-3 system. 

In figure |2^, we can see that the calculation with 
GRAPE-5 is about 8 times faster than that with 
GRAPE-3, for the entire range of N we used. In 
figure ^ we can see that the effective performance 
of GRAPE-5 exceeds 70 % of the peak performance, 
for rather small N {— 10^). 

5.2. Barnes-Hut Tree Algorithm 

The total calculation time per timestep is ex- 
pressed as 



TtYcc -^host2 '^grapc2 ^" -^comm2; 



(30) 



where rhost2, rgrapc2, and rcomm2 are the time spent 
on the host computer, the time spent on GRAPE- 
5, and the time spent for data transfer between the 
host computer and GRAPE-5, respectively. The time 
spent on the host computer is expressed as 

Nnt^ 



rhost2 = (^logioiV)t con ~r 



(31) 
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Fig. 21. — Calculation time per one timestep for di- 
rect summation algorithm, plotted as functions of the 
number of particles N. The solid and dashed curves 
are theoretical estimate for GRAPE-5 and GRAPE-3, 
respectively. The open circles represent the measured 
performance of GRAPE-5. 
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Fig. 22. — Calculation speed for direct summation 
algorithm, plotted as a function of the number of par- 
ticles N. The solid and dashed curves are theoretical 
estimate for GRAPE-5 and GRAPE-3, respectively. 
The open circles represent the measured performance 
of GRAPE-5. The dotted lines denote peak speed of 
GRAPE-5 and GRAPE-3. 



where icon is the time to construct the tree struc- 
ture, tiist is the time to construct the interaction lists, 
and tinisc2 represents 0{N) miscellaneous contribu- 
tions (Fukushige et al. 1991). In this equation, ntorms 
is the average length of the interaction list. According 
to Makino (1991), nterms is estimated as follows: 



rig + Unl^^ + Mnl^^ + 56 log 
-316l"3logio ng - 72 



-lO^e-^ log 



10 ■ 



N9^ 
23"' 



(32) 



Here, 9 is the opening angle. 

The time spent on GRAPE-5 is expressed as 



grapc2 



tcrms^'pipc 



n 



(33) 



pipe 



and the time spent for data transfer is expressed as 



comm2 



N 



16 tc 



(34) 

Figure |2^ shows the calculation time per one timestep. 
Estimates for GRAPE-3 are also shown. Table || 
shows the calculation time spent on GRAPE-5, on 
the host computer, and for data transfer for the num- 
ber of particle A^= 1048576. As the initial particle 
distribution, we use a Plummer model. All parti- 
cles have equal mass. The opening angle 6 is 0.75. 
For GRAPE-5, Ug ~ 2000. For GRAPE-3, we used 
Ug ~ 1000. The timing constants for the host com- 
puter and data transfer are chosen in the same way 



as in section 5.1 



In figure |2^, we can see that the calculation with 
GRAPE-5 is about 6 times faster than that with 
GRAPE-3, and that the tree algorithm is faster than 
the direct summation algorithm for N ^ lO''. Table || 
indicates that we can use up to two or three processor 
boards to increase the calculation speed further. In 
order to use more than four boards effectively, we need 
a host computer which has faster computing speed 
and multiple communication buses, so that the calcu- 
lation on the host computer and the data transfer do 
not limit the total performance. 

The calculation time with single processor board 
for a million particle simulation is 16 seconds per 
timestep. To put this number into perspective, paral- 
lel treecode by Dubinski (1996) would took around 60 
seconds for one million particles on 64 processor T3D, 
with 6 = 1.2. Taking into account the difference in 
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Fig. 23. — Same as figure |l|, but for the Barnes-Hut 
tree algorithm. 



0, it is perhaps fair to say our treecode on GRAPE-5 
is about 6 times faster than Dubinski's parallel code 
on 64 processor T3D (his code does not scale well 
for more than 128 processors). Yahagi et al. (1999) 
described an implementation of treecode on Fujitsu 
VPP300/16R, which took ^ 7 seconds for one million 
particles, with the opening criterion roughly corre- 
sponding to 9 — 1. Again taking into account the dif- 
ference in 9, the effective speed of our code is ~ 70% 
of VPP300/16R. 

5.3. P^M Method 

The total calculation time per timestep is ex- 
pressed as 



7p3m — 7host3 + 7f 



grapc3 



(35) 



where Thosts, Tgrapos, and Tcomms are the time spent 
on the host computer, the time spent on GRAPE- 
5, and the time spent for data transfer between the 
host computer and GRAPE-5, respectively. In the 
following we present the estimate for the case of ho- 
mogeneous distribution of particles. 

The time spent on the host computer is expressed 

as 



7host3 = 3Mp,n l0g2 Mpmifft + Nt 



misc3 • 



(36) 



Table 3: Calculation time per one timestep 
for Barnes-Hut tree algorithm (Plummer model, 
iV=1048576, 6'=0.75) 





GRAPE-5 

(b) 


Estimate for 
GRAPE-3 
(s) 


Host computer 






Tree construction 


2.8 


13.0 


Interaction list construction 


2.0 


12.7 


Miscellaneous 


1.3 


4.4 


GRAPE 


6.9 


28.4 


Data transfer 


3.1 


40.3 


Total 


16.1 


98.8 



The first term represents the time for EFT and the 
second term represents the time for 0{N) miscel- 
laneous calculations. Here Mpm is the number of 
meshes in one-dimension used in PM force calcula- 
tion, and iinisc3 represents time for miscellaneous cal- 
culation per particle. 

The time spent on GRAPE-5 is expressed as 



apc3 



27jV^tpipe 



(37) 



where Mpp is the number of meshes for PP force. We 
set the ratio Mpm/Mpp to 2.9, following Brieu et al. 
(1995) and Fukushige et al. (1996). We set Mp^ so as 
to minimize the total calculation time. The optimal 
value of Mpui is given approximately by 



pin 



11, 



-N. 



(38) 



^pipc^fft 

The time spent for data transfer is expressed as 

comm3 — -^(l^tcommj H" l^icomm,?' ~("32^comm,/ ) ■ (^9) 



16 



Figure shows the calculation time per one timestep. 
Theoretical estimate for GRAPE-5 is plotted. As 
the timing constants for the host computer, tgt and 
^miscs, we used values scaled from the ones given in 
Fukushige et al. 1996, according to the SPECfp95 
values of the hosts. 
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Fig. 24. — Same as figure |l|, but for the P-'^M 
method. Actual measurement is not available. 

The estimated calculation speed is 40 and 25 times 
faster than that with GRAPE-3 (Brieu et al. 1995) 
and MD-GRAPE (Fukushige et al. 1996), respec- 
tively. The calculation time with single processor 
board for 16 million particles simulation is 150 sec- 
onds per timestep. As discussed by Brieu et al. 
(1995), this calculation time depends only weakly on 
the degree of the clustering. This is simply because 
the calculation cost of 0{N) part of the code (mass 
assignment, particle update, etc.) dominates the total 
cost. Even at a highly clustered stage, the increase in 
the calculation cost would be a factor of two or so. On 
the other hand, CPU time of parallel P^M implemen- 
tation is very sensitive to clustering. For example, 
MacFarland et al. (1998) reported the CPU time per 
timestep increased from ^ 10 seconds to > 100 sec- 
onds for their 16 million particle simulation on 128 
processor T3E-600. Thus, depending on the degree 
of clustering, P'^M on GRAPE-5 runs at the speed of 
5-50 % of a 128 processor T3E-600. 

6. Future Prospects 

6.1. Massively-Parallel GRAPE-5 

We plan to build a massively-parallel GRAPE-5 
system with peak performance of about 1 Tflops. 



This system will consist of about 20 processor boards 
and a host computer. Each processor board is con- 
nected to a PHIB which is inserted into one PCI slot. 
The host computer has multiple processors and mul- 
tiple PCI slots. 

In order for 1 Tflops-peak GRAPE-5 system to op- 
erate efficiently with the tree algorithm, a host com- 
puter should have the effective computing speed and 
communication speed of about 5 Gflops and a few 
hundreds MB/s, respectively. Currently, these per- 
formance figures are offered only by machines with 
multiple processors. 

We consider two types of host computers. One 
is a shared-memory multiprocessors (SMPs) , and the 
other is a workstation cluster. 

The advantage of the SMP over the workstation 
cluster is that the implementation of the code is rel- 
atively easy. The disadvantages are that the price is 
relatively high and the number of processors is lim- 
ited to 10 or smaller. This limit is due to bottleneck 
in the memory access from multiple processors. 

The advantage of the workstation cluster is that 
they are inexpensive and that it is possible to connect 
^ 100 processors. For example. Warren et aZ.(1998) 
performed cosmological A^-body simulation with tree 
algorithm on Avalon, a cluster of 70 DEC Alpha pro- 
cessors (DEC Alpha 21164A, 533 MHz). The disad- 
vantage is that the implementation of the code is more 
difficult. 

We plan to build a system based on a cluster of 4- 
8 workstations. A preliminary analysis indicates that 
lOOBT Ethernet connection offers sufficient commu- 
nication bandwidth. On this system, one timestep of 
10^-body simulation with tree algorithm would take 
about 10 seconds. Using this system, for example, we 
can perform cosmological 10^-body simulation for 10'^ 
steps in one day. 

6.2. GRAPE-5/PROGRAPE System 

We plan to construct a heterogeneous comput- 
ing system, GRAPE-5/PR0GRAPE system. PRO- 
GRAPE (PROgrammable GRAPE; Hamada et al. 
1999) is a multi-purpose computer for many-body 
simulation. It consists of reconfigurable processor im- 
plemented on FPGA (Field Programmable Gate Ar- 
ray) chips and memory which stores particle data. 
PROGRAPE has a very similar architecture to GRAPE 
except that it uses FPGA chips as pipeline processor 
instead of custom LSI chips such as G5 chip. PRO- 
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GRAPE can calculate any interaction which is ex- 
pressed as 

/.=E9(P.'P,)' (40) 



where i and j arc the indices to the two sets of parti- 
cles, p^ is the data of particle i (for example position 
and mass, but can include quantities such as radius, 
pressure, temperature, etc.), and g{Pi,Pj) is an arbi- 
trary function of these particle data. 

Hamada et al. (1999) have developed the first pro- 
totype of PROGRAPE, PROGRAPE-1. It houses 
two FPGA devices (EPFIOKIOO, Altera Corp.), each 
of which has 100k logical gates. They have im- 
plemented gravitational pipelines same as that in 
GRAPE chip, which are used for GRAPE-3. The 
pipelines operates at a clock cycle of 16 MHz and the 
peak performance is 0.96 Gflops. 

One application example of the GRAPE-5/PR0GRAPE 
system is Smoothed Particle Hydrodynamics (SPH; 
Lucy 1977, Monaghan 1985). SPH is widely used in 
gas dynamical simulations in astrophysics. GRAPE- 
5 calculates the gravitational interactions and PRO- 
GRAPE calculates hydrodynamical interactions, which 
includes calculations of density, pressure gradient, 
and artificial viscosity. For the more detail about 
PROGRAPE system, see Hamada et al. (1999). 

Another example is a simulation with Ewald method 
(Ewald 1921). The Ewald method is a method to 
calculate gravitational force under periodic boundary 
condition. In the Ewald method, the force is divided 
into two interactions in real space and in wave num- 
ber space. GRAPE-5 calculates the interaction in real 
space and PROGRAPE calculates the interaction in 
wave-number space. 

It is difficult to estimate the exact performance 
of GRAPE-5/PR0GRAPE system, since the perfor- 
mance of PROGRAPE varies depending on appli- 
cations. But we can expect that the above men- 
tioned simulations on GRAPE-5/PR0GRAPE sys- 
tem would be faster than that on GRAPE-5 system 
without PROGRAPE by at least a factor of 10. 

We would like to thank Daiichiro Sugimoto, who 
have started the GRAPE project. This work was par- 
tially supported by the Research for the Future Pro- 
gram of .lapan Society for the Promotion of Science, 
JSPS-RFTP 97P01102. 
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